About Dataset The DARWIN dataset includes handwriting data from 174 participants. The classification task consists in distinguishing Alzheimer’s disease patients from healthy people.
Creator: Francesco Fontanella
Source: https://archive.ics.uci.edu/dataset/732/darwin
The DARWIN dataset was created to allow researchers to improve the existing machine-learning methodologies for the prediction of Alzheimer's disease via handwriting analysis.
Citation Requests/Acknowledgements
N. D. Cilia, C. De Stefano, F. Fontanella, A. S. Di Freca, An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science 141 (2018) 466–471. https://doi.org/10.1016/j.procs.2018.10.141
N. D. Cilia, G. De Gregorio, C. De Stefano, F. Fontanella, A. Marcelli, A. Parziale, Diagnosing Alzheimer’s disease from online handwriting: A novel dataset and performance benchmarking, Engineering Applications of Artificial Intelligence, Vol. 111 (20229) 104822. https://doi.org/10.1016/j.engappai.2022.104822
Protocol: The researchers developed a protocol consisting of 25 handwriting tasks designed to assess different aspects of cognitive and motor function potentially affected by AD. These tasks fall into three categories: Graphic, Copy, and Memory.
Data Acquisition: They collected data from 174 participants (89 AD patients and 85 healthy controls) using a Wacom Bamboo tablet, recording pen tip movements and pressure.
Feature Extraction: From the raw data, they extracted 18 features per task, encompassing measures of time, speed, acceleration, jerk, pressure, and spatial characteristics.
P: Stands for "Patients", referring to individuals diagnosed with Alzheimer's Disease.
H: Stands for "Healthy", referring to individuals who are not diagnosed with Alzheimer's Disease and serve as a control group.
1 Signature drawing M
2 Join two points with a horizontal line, continuously for four times G
3 Join two points with a vertical line, continuously for four times G
4 Retrace a circle (6 cm of diameter) continuously for four times G
5 Retrace a circle (3 cm of diameter) continuously for four times G
6 Copy the letters ‘l’, ‘m’ and ‘p’ C
7 Copy the letters on the adjacent rows C
8 Write cursively a sequence of four lowercase letter ‘l’, in a single smooth movement C
9 Write cursively a sequence of four lowercase cursive bigram ‘le’, in a single smooth movement C
10 Copy the word ‘‘foglio’’ C
11 Copy the word ‘‘foglio’’ above a line C
12 Copy the word ‘‘mamma’’ C
13 Copy the word ‘‘mamma’’ above a line C
14 Memorize the words ‘‘telefono’’, ‘‘cane’’, and ‘‘negozio’’ and rewrite them M
15 Copy in reverse the word ‘‘bottiglia’’ C
16 Copy in reverse the word ‘‘casa’’ C
17 Copy six words (regular, non regular, non words) in the appropriate boxes C
18 Write the name of the object shown in a picture (a chair) M
19 Copy the fields of a postal order C
20 Write a simple sentence under dictation M
21 Retrace a complex form G
22 Copy a telephone number C
23 Write a telephone number under dictation M
24 Draw a clock, with all hours and put hands at 11:05 (Clock Drawing Test) G
25 Copy a paragraph C
Time Features:
Total Time (TT): Overall task duration.
Air Time (AT): Time spent with the pen in the air.
Paper Time (PT): Time spent writing on the paper.
Speed Features:
Mean Speed on-paper (MSP): Average speed of writing on paper.
Mean Speed in-air (MSA): Average speed of pen movement in the air.
Movement Smoothness Features:
Mean Acceleration on-paper (MAP): Average acceleration of writing on paper.
Mean Acceleration in-air (MAA): Average acceleration of pen movement in the air.
Mean Jerk on-paper (MJP): Average jerk (change in acceleration) of writing on paper.
Mean Jerk in-air (MJA): Average jerk of pen movement in the air.
Pressure Features:
Pressure Mean (PM): Average pressure exerted by the pen on the paper.
Pressure Var (PV): Variance (fluctuation) of the pressure exerted by the pen.
Global Mean Relative Tremor (GMRT) Features:
GMRT on-paper (GMRTP): Measure of tremor during writing on paper.
GMRT in-air (GMRTA): Measure of tremor during in-air movements.
Mean GMRT (GMRT): Average of GMRTP and GMRTA.
Other Features:
Pendowns Number (PWN): Number of times the pen touches the paper.
Max X Extension (XE): Maximum horizontal distance covered by writing.
Max Y Extension (YE): Maximum vertical distance covered by writing.
Dispersion Index (DI): Measure of how much of the paper is used for writing.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import chi2
file_path = 'data.csv'
df = pd.read_csv(file_path)
df= df.drop(['ID'], axis=1)
df.head()
| air_time1 | disp_index1 | gmrt_in_air1 | gmrt_on_paper1 | max_x_extension1 | max_y_extension1 | mean_acc_in_air1 | mean_acc_on_paper1 | mean_gmrt1 | mean_jerk_in_air1 | ... | mean_jerk_in_air25 | mean_jerk_on_paper25 | mean_speed_in_air25 | mean_speed_on_paper25 | num_of_pendown25 | paper_time25 | pressure_mean25 | pressure_var25 | total_time25 | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5160 | 0.000013 | 120.804174 | 86.853334 | 957 | 6601 | 0.361800 | 0.217459 | 103.828754 | 0.051836 | ... | 0.141434 | 0.024471 | 5.596487 | 3.184589 | 71 | 40120 | 1749.278166 | 296102.7676 | 144605 | P |
| 1 | 51980 | 0.000016 | 115.318238 | 83.448681 | 1694 | 6998 | 0.272513 | 0.144880 | 99.383459 | 0.039827 | ... | 0.049663 | 0.018368 | 1.665973 | 0.950249 | 129 | 126700 | 1504.768272 | 278744.2850 | 298640 | P |
| 2 | 2600 | 0.000010 | 229.933997 | 172.761858 | 2333 | 5802 | 0.387020 | 0.181342 | 201.347928 | 0.064220 | ... | 0.178194 | 0.017174 | 4.000781 | 2.392521 | 74 | 45480 | 1431.443492 | 144411.7055 | 79025 | P |
| 3 | 2130 | 0.000010 | 369.403342 | 183.193104 | 1756 | 8159 | 0.556879 | 0.164502 | 276.298223 | 0.090408 | ... | 0.113905 | 0.019860 | 4.206746 | 1.613522 | 123 | 67945 | 1465.843329 | 230184.7154 | 181220 | P |
| 4 | 2310 | 0.000007 | 257.997131 | 111.275889 | 987 | 4732 | 0.266077 | 0.145104 | 184.636510 | 0.037528 | ... | 0.121782 | 0.020872 | 3.319036 | 1.680629 | 92 | 37285 | 1841.702561 | 158290.0255 | 72575 | P |
5 rows × 451 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 174 entries, 0 to 173 Columns: 451 entries, air_time1 to class dtypes: float64(300), int64(150), object(1) memory usage: 613.2+ KB
df.describe()
| air_time1 | disp_index1 | gmrt_in_air1 | gmrt_on_paper1 | max_x_extension1 | max_y_extension1 | mean_acc_in_air1 | mean_acc_on_paper1 | mean_gmrt1 | mean_jerk_in_air1 | ... | mean_gmrt25 | mean_jerk_in_air25 | mean_jerk_on_paper25 | mean_speed_in_air25 | mean_speed_on_paper25 | num_of_pendown25 | paper_time25 | pressure_mean25 | pressure_var25 | total_time25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | ... | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 174.000000 | 1.740000e+02 |
| mean | 5664.166667 | 0.000010 | 297.666685 | 200.504413 | 1977.965517 | 7323.896552 | 0.416374 | 0.179823 | 249.085549 | 0.067556 | ... | 221.360646 | 0.148286 | 0.019934 | 4.472643 | 2.871613 | 85.839080 | 43109.712644 | 1629.585962 | 163061.767360 | 1.642033e+05 |
| std | 12653.772746 | 0.000003 | 183.943181 | 111.629546 | 1648.306365 | 2188.290512 | 0.381837 | 0.064693 | 132.698462 | 0.074776 | ... | 63.762013 | 0.062207 | 0.002388 | 1.501411 | 0.852809 | 27.485518 | 19092.024337 | 324.142316 | 56845.610814 | 4.969397e+05 |
| min | 65.000000 | 0.000002 | 28.734515 | 29.935835 | 754.000000 | 561.000000 | 0.067748 | 0.096631 | 41.199445 | 0.011861 | ... | 69.928033 | 0.030169 | 0.014987 | 1.323565 | 0.950249 | 32.000000 | 15930.000000 | 474.049462 | 26984.926660 | 2.998000e+04 |
| 25% | 1697.500000 | 0.000008 | 174.153023 | 136.524742 | 1362.500000 | 6124.000000 | 0.218209 | 0.146647 | 161.136182 | 0.029523 | ... | 178.798382 | 0.107732 | 0.018301 | 3.485934 | 2.401199 | 66.000000 | 32803.750000 | 1499.112088 | 120099.046800 | 5.917500e+04 |
| 50% | 2890.000000 | 0.000009 | 255.791452 | 176.494494 | 1681.000000 | 6975.500000 | 0.275184 | 0.163659 | 224.445268 | 0.039233 | ... | 217.431621 | 0.140483 | 0.019488 | 4.510578 | 2.830672 | 81.000000 | 37312.500000 | 1729.385010 | 158236.771800 | 7.611500e+04 |
| 75% | 4931.250000 | 0.000011 | 358.917885 | 234.052560 | 2082.750000 | 8298.500000 | 0.442706 | 0.188879 | 294.392298 | 0.071057 | ... | 264.310776 | 0.199168 | 0.021134 | 5.212794 | 3.335828 | 101.500000 | 46533.750000 | 1865.626974 | 200921.078475 | 1.275425e+05 |
| max | 109965.000000 | 0.000028 | 1168.328276 | 865.210522 | 18602.000000 | 15783.000000 | 2.772566 | 0.627350 | 836.784702 | 0.543199 | ... | 437.373267 | 0.375078 | 0.029227 | 10.416715 | 5.602909 | 209.000000 | 139575.000000 | 1999.775983 | 352981.850000 | 5.704200e+06 |
8 rows × 450 columns
int_columns = df.select_dtypes(include='int').columns
float_columns = df.select_dtypes(include='float').columns
object_columns = df.select_dtypes(include='object').columns
print("Integer columns:", int_columns)
print("Float columns:", float_columns)
print("Object columns:", object_columns)
Integer columns: Index(['air_time1', 'max_x_extension1', 'max_y_extension1', 'num_of_pendown1',
'paper_time1', 'total_time1', 'air_time2', 'max_x_extension2',
'max_y_extension2', 'num_of_pendown2',
...
'max_y_extension24', 'num_of_pendown24', 'paper_time24', 'total_time24',
'air_time25', 'max_x_extension25', 'max_y_extension25',
'num_of_pendown25', 'paper_time25', 'total_time25'],
dtype='object', length=150)
Float columns: Index(['disp_index1', 'gmrt_in_air1', 'gmrt_on_paper1', 'mean_acc_in_air1',
'mean_acc_on_paper1', 'mean_gmrt1', 'mean_jerk_in_air1',
'mean_jerk_on_paper1', 'mean_speed_in_air1', 'mean_speed_on_paper1',
...
'gmrt_on_paper25', 'mean_acc_in_air25', 'mean_acc_on_paper25',
'mean_gmrt25', 'mean_jerk_in_air25', 'mean_jerk_on_paper25',
'mean_speed_in_air25', 'mean_speed_on_paper25', 'pressure_mean25',
'pressure_var25'],
dtype='object', length=300)
Object columns: Index(['class'], dtype='object')
#categorical_columns = list(X.select_dtypes(include=['object', 'category']).columns)
#numerical_columns = [col for col in X.columns if col not in categorical_columns]
#print("Total columns: {}. Categorical columns {}. Numerical columns {}".format(len(X.columns), len(categorical_columns), len(numerical_columns)))
target_counts = df['class'].value_counts()
print(target_counts)
# Plot the bar chart
plt.figure(figsize=(10, 6))
target_counts.plot(kind='bar')
plt.xlabel('Alzheimers patient Class')
plt.ylabel('Count')
plt.title('Distribution of Alzheimers patient Class')
plt.xticks(rotation=45)
plt.show()
class P 89 H 85 Name: count, dtype: int64
The dataset was collected when the patient performed 25 tasks, each task has the same 18 features. So for learning purpose, we will perform the EDA on 1 task for the illustration.
# Subset 18 features of task 1
num_subset = df.iloc[:,0:18]
# Subset 18 features of task 1 and task 2
num_subset2 = df.iloc[:,0:36]
# Subset 18 features of task 1 , task 2 and task 3
num_subset3 = df.iloc[:,0:54]
num_subset.head()
| air_time1 | disp_index1 | gmrt_in_air1 | gmrt_on_paper1 | max_x_extension1 | max_y_extension1 | mean_acc_in_air1 | mean_acc_on_paper1 | mean_gmrt1 | mean_jerk_in_air1 | mean_jerk_on_paper1 | mean_speed_in_air1 | mean_speed_on_paper1 | num_of_pendown1 | paper_time1 | pressure_mean1 | pressure_var1 | total_time1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5160 | 0.000013 | 120.804174 | 86.853334 | 957 | 6601 | 0.361800 | 0.217459 | 103.828754 | 0.051836 | 0.021547 | 1.828076 | 1.493242 | 22 | 10730 | 1679.232060 | 288285.0449 | 15890 |
| 1 | 51980 | 0.000016 | 115.318238 | 83.448681 | 1694 | 6998 | 0.272513 | 0.144880 | 99.383459 | 0.039827 | 0.016885 | 1.817744 | 1.517763 | 11 | 12460 | 1723.171348 | 210516.6356 | 64440 |
| 2 | 2600 | 0.000010 | 229.933997 | 172.761858 | 2333 | 5802 | 0.387020 | 0.181342 | 201.347928 | 0.064220 | 0.020126 | 3.378343 | 3.308866 | 10 | 6080 | 1520.253289 | 120845.8717 | 8680 |
| 3 | 2130 | 0.000010 | 369.403342 | 183.193104 | 1756 | 8159 | 0.556879 | 0.164502 | 276.298223 | 0.090408 | 0.021150 | 5.082499 | 3.542645 | 10 | 5595 | 1913.995532 | 100286.6032 | 7725 |
| 4 | 2310 | 0.000007 | 257.997131 | 111.275889 | 987 | 4732 | 0.266077 | 0.145104 | 184.636510 | 0.037528 | 0.018590 | 3.804656 | 2.180544 | 8 | 4080 | 1819.121324 | 160061.8198 | 6390 |
num_cols = num_subset.columns
# Plot histograms
plt.figure(figsize=(20, 20))
for i, col in enumerate(num_cols, 1):
plt.subplot(6, 4, i) # Adjust the number of rows and columns as needed
num_subset[col].dropna().hist(bins=30) # Drop NaN values for histogram
plt.title(col)
plt.xlabel(col)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
sns.pairplot(df[num_cols.tolist() + ['class']], hue='class')
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) //anaconda3/lib/python3.11/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
plt.figure(figsize=(20, 20))
for i, col in enumerate(num_cols, 1):
plt.subplot(6, 4, i) # Adjust the number of rows and columns as needed
sns.boxplot(x=num_subset[col].dropna())
plt.title(col)
plt.xlabel(col)
plt.tight_layout()
plt.show()
//anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) //anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
There are a lot of outliers.
correlation_matrix = num_subset.corr()
# Filter correlations larger than |0.7|
high_correlation = correlation_matrix[(correlation_matrix > 0.7) | (correlation_matrix < -0.7)]
# Set diagonal and lower triangle to NaN to avoid duplication
for i in range(len(high_correlation)):
for j in range(i+1):
high_correlation.iat[i, j] = None
# Drop rows and columns with all NaN values
high_correlation = high_correlation.dropna(how='all', axis=0).dropna(how='all', axis=1)
high_correlation
| mean_acc_on_paper1 | mean_gmrt1 | mean_jerk_in_air1 | mean_jerk_on_paper1 | mean_speed_in_air1 | mean_speed_on_paper1 | paper_time1 | total_time1 | |
|---|---|---|---|---|---|---|---|---|
| air_time1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.972902 |
| disp_index1 | NaN | NaN | NaN | NaN | NaN | NaN | 0.808375 | NaN |
| gmrt_in_air1 | NaN | 0.940325 | NaN | NaN | 0.826749 | NaN | NaN | NaN |
| gmrt_on_paper1 | 0.875478 | 0.828012 | NaN | 0.738690 | NaN | 0.985557 | NaN | NaN |
| mean_acc_in_air1 | NaN | NaN | 0.988005 | NaN | NaN | NaN | NaN | NaN |
| mean_acc_on_paper1 | NaN | NaN | NaN | 0.886071 | NaN | 0.896524 | NaN | NaN |
| mean_gmrt1 | NaN | NaN | NaN | NaN | 0.863950 | 0.822041 | NaN | NaN |
| mean_jerk_on_paper1 | NaN | NaN | NaN | NaN | NaN | 0.749382 | NaN | NaN |
| num_of_pendown1 | NaN | NaN | NaN | NaN | NaN | NaN | 0.726058 | NaN |
| paper_time1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.757173 |
correlation_matrix = num_subset2.corr()
# Filter correlations larger than |0.7|
high_correlation = correlation_matrix[(correlation_matrix > 0.7) | (correlation_matrix < -0.7)]
# Set diagonal and lower triangle to NaN to avoid duplication
for i in range(len(high_correlation)):
for j in range(i+1):
high_correlation.iat[i, j] = None
# Drop rows and columns with all NaN values
high_correlation = high_correlation.dropna(how='all', axis=0).dropna(how='all', axis=1)
high_correlation
| mean_acc_on_paper1 | mean_gmrt1 | mean_jerk_in_air1 | mean_jerk_on_paper1 | mean_speed_in_air1 | mean_speed_on_paper1 | paper_time1 | total_time1 | mean_gmrt2 | mean_jerk_in_air2 | mean_jerk_on_paper2 | mean_speed_on_paper2 | num_of_pendown2 | paper_time2 | total_time2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| air_time1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.972902 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| disp_index1 | NaN | NaN | NaN | NaN | NaN | NaN | 0.808375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| gmrt_in_air1 | NaN | 0.940325 | NaN | NaN | 0.826749 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| gmrt_on_paper1 | 0.875478 | 0.828012 | NaN | 0.738690 | NaN | 0.985557 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean_acc_in_air1 | NaN | NaN | 0.988005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean_acc_on_paper1 | NaN | NaN | NaN | 0.886071 | NaN | 0.896524 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean_gmrt1 | NaN | NaN | NaN | NaN | 0.863950 | 0.822041 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean_jerk_on_paper1 | NaN | NaN | NaN | NaN | NaN | 0.749382 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| num_of_pendown1 | NaN | NaN | NaN | NaN | NaN | NaN | 0.726058 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| paper_time1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.757173 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| air_time2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.878915 | 0.730840 | 0.964098 |
| disp_index2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.767908 | NaN |
| gmrt_in_air2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.932219 | NaN | NaN | NaN | NaN | NaN | NaN |
| gmrt_on_paper2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.896537 | NaN | NaN | NaN |
| mean_acc_in_air2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.994576 | NaN | NaN | NaN | NaN | NaN |
| mean_acc_on_paper2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.818415 | NaN | NaN | NaN | NaN |
| num_of_pendown2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.869366 |
| paper_time2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.885851 |
Some of the 18 features in Task 1 are highly correlated with each other, but none of the 18 features in Task 1 are highly correlated with the features in Tasks 2, 3,... or 25.
The features are highly correlated within each task only.
# Plot the high correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(high_correlation, annot=True, cmap='coolwarm', vmin=-1, vmax=1, linewidths=0.5, linecolor='gray')
plt.title('High Correlation Heatmap (|correlation| > 0.7)')
plt.show()
tasks = {
"Task1": df.iloc[:, 0:18],
"Task2": df.iloc[:, 18:36],
"Task3": df.iloc[:, 36:54],
"Task4": df.iloc[:, 54:72],
"Task5": df.iloc[:, 72:90],
"Task6": df.iloc[:, 90:108],
"Task7": df.iloc[:, 108:126],
"Task8": df.iloc[:, 126:144],
"Task9": df.iloc[:, 144:162],
"Task10": df.iloc[:, 162:180],
"Task11": df.iloc[:, 180:198],
"Task12": df.iloc[:, 198:216],
"Task13": df.iloc[:, 216:234],
"Task14": df.iloc[:, 234:252],
"Task15": df.iloc[:, 252:270],
"Task16": df.iloc[:, 270:288],
"Task17": df.iloc[:, 288:306],
"Task18": df.iloc[:, 306:324],
"Task19": df.iloc[:, 324:342],
"Task20": df.iloc[:, 342:360],
"Task21": df.iloc[:, 360:378],
"Task22": df.iloc[:, 378:396],
"Task23": df.iloc[:, 396:414],
"Task24": df.iloc[:, 414:432],
"Task25": df.iloc[:, 432:450]
}
# Dictionary to store the high correlation pairs for each task
high_corr_pairs_dict = {}
for task_name, task_data in tasks.items():
# Calculate the correlation matrix
correlation_matrix = task_data.corr()
# Filter correlations larger than |0.7|
high_correlation = correlation_matrix[(correlation_matrix > 0.7) | (correlation_matrix < -0.7)]
# Set diagonal and lower triangle to NaN to avoid duplication
for i in range(len(high_correlation)):
for j in range(i + 1):
high_correlation.iat[i, j] = None
# Drop rows and columns with all NaN values
high_correlation = high_correlation.dropna(how='all', axis=0).dropna(how='all', axis=1)
# List the pairs that are highly correlated
high_corr_pairs = high_correlation.stack().reset_index()
high_corr_pairs.columns = ['Feature1', 'Feature2', 'Correlation']
# Store the result in the dictionary
high_corr_pairs_dict[task_name] = high_corr_pairs
# Print the high correlation pairs for each task
for task_name, high_corr_pairs in high_corr_pairs_dict.items():
if not high_corr_pairs.empty:
print(f"Highly Correlated Feature Pairs in {task_name}:")
print(high_corr_pairs[['Feature1', 'Feature2', 'Correlation']])
print("\n")
Highly Correlated Feature Pairs in Task1:
Feature1 Feature2 Correlation
0 air_time1 total_time1 0.972902
1 disp_index1 paper_time1 0.808375
2 gmrt_in_air1 mean_gmrt1 0.940325
3 gmrt_in_air1 mean_speed_in_air1 0.826749
4 gmrt_on_paper1 mean_acc_on_paper1 0.875478
5 gmrt_on_paper1 mean_gmrt1 0.828012
6 gmrt_on_paper1 mean_jerk_on_paper1 0.738690
7 gmrt_on_paper1 mean_speed_on_paper1 0.985557
8 mean_acc_in_air1 mean_jerk_in_air1 0.988005
9 mean_acc_on_paper1 mean_jerk_on_paper1 0.886071
10 mean_acc_on_paper1 mean_speed_on_paper1 0.896524
11 mean_gmrt1 mean_speed_in_air1 0.863950
12 mean_gmrt1 mean_speed_on_paper1 0.822041
13 mean_jerk_on_paper1 mean_speed_on_paper1 0.749382
14 num_of_pendown1 paper_time1 0.726058
15 paper_time1 total_time1 0.757173
Highly Correlated Feature Pairs in Task2:
Feature1 Feature2 Correlation
0 air_time2 num_of_pendown2 0.878915
1 air_time2 paper_time2 0.730840
2 air_time2 total_time2 0.964098
3 disp_index2 paper_time2 0.767908
4 gmrt_in_air2 mean_gmrt2 0.932219
5 gmrt_on_paper2 mean_speed_on_paper2 0.896537
6 mean_acc_in_air2 mean_jerk_in_air2 0.994576
7 mean_acc_on_paper2 mean_jerk_on_paper2 0.818415
8 num_of_pendown2 total_time2 0.869366
9 paper_time2 total_time2 0.885851
Highly Correlated Feature Pairs in Task3:
Feature1 Feature2 Correlation
0 air_time3 total_time3 0.847516
1 disp_index3 paper_time3 0.842708
2 gmrt_in_air3 mean_gmrt3 0.933240
3 gmrt_on_paper3 mean_speed_on_paper3 0.899824
4 mean_acc_in_air3 mean_jerk_in_air3 0.996660
5 mean_acc_on_paper3 mean_jerk_on_paper3 0.802542
6 num_of_pendown3 pressure_var3 0.719333
7 paper_time3 total_time3 0.814712
Highly Correlated Feature Pairs in Task4:
Feature1 Feature2 Correlation
0 air_time4 total_time4 0.936848
1 gmrt_in_air4 mean_gmrt4 0.944949
2 gmrt_in_air4 mean_speed_in_air4 0.776363
3 gmrt_on_paper4 mean_speed_on_paper4 0.988750
4 max_x_extension4 max_y_extension4 0.770221
5 mean_acc_in_air4 mean_jerk_in_air4 0.996134
6 mean_acc_in_air4 mean_speed_in_air4 0.762881
7 mean_acc_on_paper4 mean_jerk_on_paper4 0.869096
8 mean_gmrt4 mean_speed_in_air4 0.788678
9 mean_jerk_in_air4 mean_speed_in_air4 0.746192
Highly Correlated Feature Pairs in Task5:
Feature1 Feature2 Correlation
0 air_time5 total_time5 0.841695
1 disp_index5 max_y_extension5 0.850136
2 gmrt_in_air5 mean_gmrt5 0.881580
3 gmrt_on_paper5 mean_speed_on_paper5 0.990230
4 max_x_extension5 max_y_extension5 0.707088
5 mean_acc_in_air5 mean_jerk_in_air5 0.969452
6 mean_acc_on_paper5 mean_jerk_on_paper5 0.841853
7 paper_time5 total_time5 0.905422
Highly Correlated Feature Pairs in Task6:
Feature1 Feature2 Correlation
0 air_time6 total_time6 0.984177
1 disp_index6 paper_time6 0.804206
2 disp_index6 total_time6 0.718099
3 gmrt_in_air6 mean_acc_in_air6 0.728092
4 gmrt_in_air6 mean_gmrt6 0.859972
5 gmrt_in_air6 mean_jerk_in_air6 0.728523
6 gmrt_in_air6 mean_speed_in_air6 0.948211
7 gmrt_on_paper6 mean_acc_on_paper6 0.705548
8 gmrt_on_paper6 mean_gmrt6 0.777139
9 gmrt_on_paper6 mean_speed_on_paper6 0.972472
10 mean_acc_in_air6 mean_jerk_in_air6 0.999460
11 mean_acc_in_air6 mean_speed_in_air6 0.803366
12 mean_acc_on_paper6 mean_jerk_on_paper6 0.841727
13 mean_acc_on_paper6 mean_speed_on_paper6 0.737166
14 mean_gmrt6 mean_speed_in_air6 0.809440
15 mean_gmrt6 mean_speed_on_paper6 0.774425
16 mean_jerk_in_air6 mean_speed_in_air6 0.800747
17 num_of_pendown6 paper_time6 0.716184
18 paper_time6 total_time6 0.789970
Highly Correlated Feature Pairs in Task7:
Feature1 Feature2 Correlation
0 air_time7 total_time7 0.995395
1 disp_index7 gmrt_on_paper7 -0.710178
2 disp_index7 paper_time7 0.854556
3 gmrt_in_air7 mean_acc_in_air7 0.790206
4 gmrt_in_air7 mean_gmrt7 0.861947
5 gmrt_in_air7 mean_jerk_in_air7 0.788250
6 gmrt_in_air7 mean_speed_in_air7 0.989095
7 gmrt_on_paper7 mean_acc_on_paper7 0.704073
8 gmrt_on_paper7 mean_gmrt7 0.925679
9 gmrt_on_paper7 mean_jerk_on_paper7 0.757407
10 gmrt_on_paper7 mean_speed_on_paper7 0.960689
11 mean_acc_in_air7 mean_jerk_in_air7 0.999657
12 mean_acc_in_air7 mean_speed_in_air7 0.812783
13 mean_acc_on_paper7 mean_jerk_on_paper7 0.894712
14 mean_acc_on_paper7 mean_speed_on_paper7 0.716698
15 mean_gmrt7 mean_speed_in_air7 0.841244
16 mean_gmrt7 mean_speed_on_paper7 0.885753
17 mean_jerk_in_air7 mean_speed_in_air7 0.809633
18 mean_jerk_on_paper7 mean_speed_on_paper7 0.763888
Highly Correlated Feature Pairs in Task8:
Feature1 Feature2 Correlation
0 air_time8 num_of_pendown8 0.788848
1 air_time8 total_time8 0.930366
2 disp_index8 paper_time8 0.837786
3 disp_index8 total_time8 0.703056
4 gmrt_in_air8 mean_gmrt8 0.968389
5 gmrt_on_paper8 mean_speed_on_paper8 0.996254
6 mean_acc_in_air8 mean_jerk_in_air8 0.996209
7 mean_acc_on_paper8 mean_jerk_on_paper8 0.817212
8 num_of_pendown8 total_time8 0.802352
9 paper_time8 total_time8 0.821477
Highly Correlated Feature Pairs in Task9:
Feature1 Feature2 Correlation
0 air_time9 num_of_pendown9 0.808861
1 air_time9 total_time9 0.945315
2 gmrt_in_air9 mean_gmrt9 0.976753
3 gmrt_on_paper9 mean_speed_on_paper9 0.989501
4 mean_acc_in_air9 mean_jerk_in_air9 0.998487
5 mean_acc_on_paper9 mean_jerk_on_paper9 0.792467
6 num_of_pendown9 pressure_var9 0.723058
7 num_of_pendown9 total_time9 0.806161
8 paper_time9 total_time9 0.786542
Highly Correlated Feature Pairs in Task10:
Feature1 Feature2 Correlation
0 air_time10 num_of_pendown10 0.718506
1 air_time10 paper_time10 0.754052
2 air_time10 total_time10 0.977930
3 gmrt_in_air10 mean_gmrt10 0.910764
4 gmrt_in_air10 mean_speed_in_air10 0.804005
5 gmrt_on_paper10 mean_acc_on_paper10 0.734830
6 gmrt_on_paper10 mean_gmrt10 0.815390
7 gmrt_on_paper10 mean_speed_on_paper10 0.986698
8 mean_acc_in_air10 mean_jerk_in_air10 0.993249
9 mean_acc_on_paper10 mean_jerk_on_paper10 0.838245
10 mean_acc_on_paper10 mean_speed_on_paper10 0.745339
11 mean_gmrt10 mean_speed_in_air10 0.736995
12 mean_gmrt10 mean_speed_on_paper10 0.825203
13 num_of_pendown10 paper_time10 0.711796
14 num_of_pendown10 total_time10 0.756727
15 paper_time10 total_time10 0.874640
Highly Correlated Feature Pairs in Task11:
Feature1 Feature2 Correlation
0 air_time11 total_time11 0.998010
1 gmrt_in_air11 mean_gmrt11 0.872288
2 gmrt_in_air11 mean_speed_in_air11 0.817690
3 gmrt_on_paper11 mean_acc_on_paper11 0.739347
4 gmrt_on_paper11 mean_gmrt11 0.843760
5 gmrt_on_paper11 mean_speed_on_paper11 0.981935
6 mean_acc_in_air11 mean_jerk_in_air11 0.996682
7 mean_acc_on_paper11 mean_jerk_on_paper11 0.856793
8 mean_acc_on_paper11 mean_speed_on_paper11 0.755491
9 mean_gmrt11 mean_speed_in_air11 0.726196
10 mean_gmrt11 mean_speed_on_paper11 0.843252
Highly Correlated Feature Pairs in Task12:
Feature1 Feature2 Correlation
0 air_time12 total_time12 0.996805
1 gmrt_in_air12 mean_gmrt12 0.982338
2 gmrt_on_paper12 mean_speed_on_paper12 0.989997
3 mean_acc_in_air12 mean_jerk_in_air12 0.988375
4 mean_acc_on_paper12 mean_jerk_on_paper12 0.857532
Highly Correlated Feature Pairs in Task13:
Feature1 Feature2 Correlation
0 air_time13 total_time13 0.913617
1 gmrt_in_air13 mean_gmrt13 0.963367
2 gmrt_in_air13 mean_speed_in_air13 0.749360
3 gmrt_on_paper13 mean_speed_on_paper13 0.983036
4 mean_acc_in_air13 mean_jerk_in_air13 0.997881
5 mean_acc_in_air13 mean_speed_in_air13 0.777862
6 mean_acc_on_paper13 mean_jerk_on_paper13 0.804902
7 mean_gmrt13 mean_speed_in_air13 0.748023
8 mean_jerk_in_air13 mean_speed_in_air13 0.766715
9 paper_time13 total_time13 0.838600
Highly Correlated Feature Pairs in Task14:
Feature1 Feature2 Correlation
0 air_time14 total_time14 0.999728
1 disp_index14 paper_time14 0.756335
2 gmrt_in_air14 gmrt_on_paper14 0.700201
3 gmrt_in_air14 mean_gmrt14 0.957898
4 gmrt_in_air14 mean_speed_in_air14 0.951554
5 gmrt_in_air14 mean_speed_on_paper14 0.705478
6 gmrt_on_paper14 max_x_extension14 0.726850
7 gmrt_on_paper14 mean_acc_on_paper14 0.714079
8 gmrt_on_paper14 mean_gmrt14 0.875701
9 gmrt_on_paper14 mean_speed_in_air14 0.717726
10 gmrt_on_paper14 mean_speed_on_paper14 0.968376
11 mean_acc_in_air14 mean_jerk_in_air14 0.998935
12 mean_acc_in_air14 mean_speed_in_air14 0.706889
13 mean_acc_on_paper14 mean_jerk_on_paper14 0.890177
14 mean_acc_on_paper14 mean_speed_on_paper14 0.743337
15 mean_gmrt14 mean_speed_in_air14 0.932180
16 mean_gmrt14 mean_speed_on_paper14 0.866553
17 mean_speed_in_air14 mean_speed_on_paper14 0.707248
Highly Correlated Feature Pairs in Task15:
Feature1 Feature2 Correlation
0 air_time15 total_time15 0.995795
1 disp_index15 paper_time15 0.739648
2 gmrt_in_air15 mean_gmrt15 0.812303
3 gmrt_in_air15 mean_speed_in_air15 0.952605
4 gmrt_on_paper15 mean_acc_on_paper15 0.727283
5 gmrt_on_paper15 mean_gmrt15 0.839909
6 gmrt_on_paper15 mean_jerk_on_paper15 0.761560
7 gmrt_on_paper15 mean_speed_on_paper15 0.955322
8 mean_acc_in_air15 mean_jerk_in_air15 0.998358
9 mean_acc_in_air15 mean_speed_in_air15 0.733157
10 mean_acc_on_paper15 mean_jerk_on_paper15 0.887051
11 mean_acc_on_paper15 mean_speed_on_paper15 0.752111
12 mean_gmrt15 mean_speed_in_air15 0.764917
13 mean_gmrt15 mean_speed_on_paper15 0.802560
14 mean_jerk_in_air15 mean_speed_in_air15 0.720893
15 mean_jerk_on_paper15 mean_speed_on_paper15 0.769756
Highly Correlated Feature Pairs in Task16:
Feature1 Feature2 Correlation
0 air_time16 total_time16 0.974063
1 disp_index16 max_x_extension16 0.919359
2 disp_index16 max_y_extension16 0.844919
3 disp_index16 num_of_pendown16 0.791532
4 disp_index16 paper_time16 0.803024
5 gmrt_in_air16 mean_gmrt16 0.903230
6 gmrt_in_air16 mean_speed_in_air16 0.824760
7 gmrt_on_paper16 mean_speed_on_paper16 0.918512
8 max_x_extension16 max_y_extension16 0.827823
9 max_x_extension16 num_of_pendown16 0.765683
10 max_y_extension16 num_of_pendown16 0.751162
11 mean_acc_in_air16 mean_jerk_in_air16 0.998580
12 mean_acc_in_air16 mean_speed_in_air16 0.723327
13 mean_acc_on_paper16 mean_jerk_on_paper16 0.853240
14 mean_gmrt16 mean_speed_in_air16 0.750478
15 mean_jerk_in_air16 mean_speed_in_air16 0.721484
16 num_of_pendown16 paper_time16 0.728595
17 paper_time16 total_time16 0.750774
Highly Correlated Feature Pairs in Task17:
Feature1 Feature2 Correlation
0 air_time17 total_time17 0.991785
1 gmrt_in_air17 mean_acc_in_air17 0.920234
2 gmrt_in_air17 mean_gmrt17 0.961986
3 gmrt_in_air17 mean_jerk_in_air17 0.918957
4 gmrt_in_air17 mean_speed_in_air17 0.990772
5 gmrt_on_paper17 mean_gmrt17 0.802489
6 gmrt_on_paper17 mean_speed_on_paper17 0.949224
7 max_x_extension17 max_y_extension17 0.836344
8 mean_acc_in_air17 mean_gmrt17 0.881192
9 mean_acc_in_air17 mean_jerk_in_air17 0.999833
10 mean_acc_in_air17 mean_speed_in_air17 0.924006
11 mean_acc_on_paper17 mean_jerk_on_paper17 0.890039
12 mean_gmrt17 mean_jerk_in_air17 0.880017
13 mean_gmrt17 mean_speed_in_air17 0.965667
14 mean_gmrt17 mean_speed_on_paper17 0.772327
15 mean_jerk_in_air17 mean_speed_in_air17 0.922295
Highly Correlated Feature Pairs in Task18:
Feature1 Feature2 Correlation
0 air_time18 disp_index18 0.808225
1 air_time18 num_of_pendown18 0.924399
2 air_time18 paper_time18 0.848103
3 air_time18 total_time18 0.985891
4 disp_index18 max_x_extension18 0.811118
5 disp_index18 num_of_pendown18 0.894879
6 disp_index18 paper_time18 0.877213
7 disp_index18 total_time18 0.857403
8 gmrt_in_air18 mean_gmrt18 0.938460
9 gmrt_in_air18 mean_speed_in_air18 0.925410
10 gmrt_on_paper18 mean_acc_on_paper18 0.727088
11 gmrt_on_paper18 mean_gmrt18 0.716782
12 gmrt_on_paper18 mean_speed_on_paper18 0.960014
13 max_x_extension18 max_y_extension18 0.769647
14 max_x_extension18 num_of_pendown18 0.724847
15 mean_acc_in_air18 mean_jerk_in_air18 0.996263
16 mean_acc_on_paper18 mean_jerk_on_paper18 0.842544
17 mean_acc_on_paper18 mean_speed_on_paper18 0.755078
18 mean_gmrt18 mean_speed_in_air18 0.878387
19 mean_gmrt18 mean_speed_on_paper18 0.751618
20 num_of_pendown18 paper_time18 0.886797
21 num_of_pendown18 total_time18 0.943838
22 paper_time18 total_time18 0.924825
Highly Correlated Feature Pairs in Task19:
Feature1 Feature2 Correlation
0 air_time19 total_time19 0.999992
1 gmrt_in_air19 mean_acc_in_air19 0.741345
2 gmrt_in_air19 mean_gmrt19 0.875684
3 gmrt_in_air19 mean_jerk_in_air19 0.731213
4 gmrt_in_air19 mean_speed_in_air19 0.986371
5 gmrt_on_paper19 mean_gmrt19 0.852113
6 gmrt_on_paper19 mean_speed_on_paper19 0.963996
7 mean_acc_in_air19 mean_jerk_in_air19 0.998665
8 mean_acc_in_air19 mean_speed_in_air19 0.729755
9 mean_acc_on_paper19 mean_jerk_on_paper19 0.854577
10 mean_gmrt19 mean_speed_in_air19 0.841253
11 mean_gmrt19 mean_speed_on_paper19 0.826229
12 mean_jerk_in_air19 mean_speed_in_air19 0.716694
Highly Correlated Feature Pairs in Task20:
Feature1 Feature2 Correlation
0 air_time20 total_time20 0.978060
1 disp_index20 paper_time20 0.775212
2 gmrt_in_air20 mean_gmrt20 0.923186
3 gmrt_in_air20 mean_speed_in_air20 0.901774
4 gmrt_on_paper20 mean_gmrt20 0.753344
5 gmrt_on_paper20 mean_speed_on_paper20 0.973751
6 mean_acc_in_air20 mean_jerk_in_air20 0.996749
7 mean_acc_on_paper20 mean_jerk_on_paper20 0.873658
8 mean_gmrt20 mean_speed_in_air20 0.882880
9 mean_gmrt20 mean_speed_on_paper20 0.775548
Highly Correlated Feature Pairs in Task21:
Feature1 Feature2 Correlation
0 air_time21 num_of_pendown21 0.815425
1 air_time21 total_time21 0.844409
2 disp_index21 paper_time21 0.791304
3 gmrt_in_air21 mean_gmrt21 0.988968
4 gmrt_on_paper21 mean_acc_on_paper21 -0.725525
5 gmrt_on_paper21 mean_speed_on_paper21 0.982040
6 max_x_extension21 max_y_extension21 0.861927
7 mean_acc_in_air21 mean_jerk_in_air21 0.999764
8 mean_acc_in_air21 mean_speed_in_air21 0.936375
9 mean_acc_on_paper21 mean_jerk_on_paper21 0.905703
10 mean_acc_on_paper21 mean_speed_on_paper21 -0.736360
11 mean_jerk_in_air21 mean_speed_in_air21 0.934035
12 num_of_pendown21 paper_time21 0.703650
13 num_of_pendown21 total_time21 0.829889
14 paper_time21 total_time21 0.937578
Highly Correlated Feature Pairs in Task22:
Feature1 Feature2 Correlation
0 air_time22 total_time22 0.999412
1 gmrt_in_air22 mean_gmrt22 0.903649
2 gmrt_in_air22 mean_speed_in_air22 0.961481
3 gmrt_on_paper22 mean_gmrt22 0.902864
4 gmrt_on_paper22 mean_speed_on_paper22 0.978226
5 mean_acc_in_air22 mean_jerk_in_air22 0.992637
6 mean_acc_on_paper22 mean_jerk_on_paper22 0.869012
7 mean_gmrt22 mean_speed_in_air22 0.880084
8 mean_gmrt22 mean_speed_on_paper22 0.884698
Highly Correlated Feature Pairs in Task23:
Feature1 Feature2 Correlation
0 air_time23 total_time23 0.988343
1 gmrt_in_air23 mean_gmrt23 0.874589
2 gmrt_in_air23 mean_speed_in_air23 0.974215
3 gmrt_on_paper23 mean_gmrt23 0.919647
4 gmrt_on_paper23 mean_speed_on_paper23 0.971879
5 mean_acc_in_air23 mean_jerk_in_air23 0.991724
6 mean_acc_on_paper23 mean_jerk_on_paper23 0.850627
7 mean_acc_on_paper23 mean_speed_on_paper23 0.719526
8 mean_gmrt23 mean_speed_in_air23 0.831196
9 mean_gmrt23 mean_speed_on_paper23 0.907974
Highly Correlated Feature Pairs in Task24:
Feature1 Feature2 Correlation
0 air_time24 total_time24 0.987239
1 gmrt_in_air24 gmrt_on_paper24 0.775761
2 gmrt_in_air24 mean_acc_in_air24 0.807689
3 gmrt_in_air24 mean_gmrt24 0.930178
4 gmrt_in_air24 mean_jerk_in_air24 0.802715
5 gmrt_in_air24 mean_speed_in_air24 0.983763
6 gmrt_on_paper24 mean_gmrt24 0.953251
7 gmrt_on_paper24 mean_speed_in_air24 0.731777
8 gmrt_on_paper24 mean_speed_on_paper24 0.923057
9 max_x_extension24 max_y_extension24 0.932840
10 mean_acc_in_air24 mean_gmrt24 0.727239
11 mean_acc_in_air24 mean_jerk_in_air24 0.999320
12 mean_acc_in_air24 mean_speed_in_air24 0.820703
13 mean_acc_on_paper24 mean_jerk_on_paper24 0.893046
14 mean_gmrt24 mean_jerk_in_air24 0.724329
15 mean_gmrt24 mean_speed_in_air24 0.896815
16 mean_gmrt24 mean_speed_on_paper24 0.871719
17 mean_jerk_in_air24 mean_speed_in_air24 0.813987
Highly Correlated Feature Pairs in Task25:
Feature1 Feature2 Correlation
0 air_time25 total_time25 0.999294
1 disp_index25 paper_time25 0.703494
2 gmrt_in_air25 mean_acc_in_air25 0.872965
3 gmrt_in_air25 mean_gmrt25 0.949397
4 gmrt_in_air25 mean_jerk_in_air25 0.871563
5 gmrt_in_air25 mean_speed_in_air25 0.984926
6 gmrt_in_air25 mean_speed_on_paper25 0.708510
7 gmrt_on_paper25 mean_gmrt25 0.879144
8 gmrt_on_paper25 mean_speed_on_paper25 0.964624
9 mean_acc_in_air25 mean_gmrt25 0.839432
10 mean_acc_in_air25 mean_jerk_in_air25 0.999427
11 mean_acc_in_air25 mean_speed_in_air25 0.869102
12 mean_acc_on_paper25 mean_jerk_on_paper25 0.891626
13 mean_gmrt25 mean_jerk_in_air25 0.838844
14 mean_gmrt25 mean_speed_in_air25 0.943679
15 mean_gmrt25 mean_speed_on_paper25 0.879285
16 mean_jerk_in_air25 mean_speed_in_air25 0.864771
17 mean_speed_in_air25 mean_speed_on_paper25 0.718706
1. air_time & total_time:
These pairs are likely measuring time-related metrics, possibly in different contexts or phases of tasks (e.g., total time vs. specific time in the air).
2. disp_index & paper_time:
disp_index and paper_time could be related to indices or metrics related to paper tasks or measurements taken on paper.
3. gmrt_in_air & mean_gmrt:
gmrt_in_air and mean_gmrt seem to relate to metrics involving GMRT (possibly Global Mean Response Time), measured either in the air or as an average across tasks.
4. gmrt_on_paper & mean_speed_on_paper:
gmrt_on_paper and mean_speed_on_paper might indicate GMRT or speed-related metrics specifically measured or averaged on paper.
5. mean_acc_in_air & mean_jerk_in_air:
mean_acc_in_air and mean_jerk_in_air likely represent mean acceleration and mean jerk measured during tasks in the air, indicating movement or dynamic metrics.
6. mean_acc_on_paper & mean_jerk_on_paper:
mean_acc_on_paper and mean_jerk_on_paper similarly suggest mean acceleration and mean jerk metrics, but specifically measured or averaged during tasks on paper.
vif_results = {}
# Iterate through each task
for task_name, task_data in tasks.items():
# Add constant to the task data
X_num = sm.add_constant(task_data)
# Calculate VIF
vif = pd.DataFrame()
vif["Features"] = X_num.columns
vif["VIF"] = [variance_inflation_factor(X_num.values, i) for i in range(X_num.shape[1])]
# Store VIF results for the current task
vif_results[task_name] = vif
# Print or use vif_results as needed
for task_name, vif_result in vif_results.items():
print(f"VIF for {task_name}:")
print(vif_result)
print("\n")
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i)
VIF for Task1:
Features VIF
0 const 1.105667e+02
1 air_time1 3.753000e+14
2 disp_index1 7.928363e+00
3 gmrt_in_air1 inf
4 gmrt_on_paper1 inf
5 max_x_extension1 1.324342e+00
6 max_y_extension1 3.573801e+00
7 mean_acc_in_air1 8.697292e+01
8 mean_acc_on_paper1 1.549472e+01
9 mean_gmrt1 inf
10 mean_jerk_in_air1 7.490932e+01
11 mean_jerk_on_paper1 6.111167e+00
12 mean_speed_in_air1 8.744612e+00
13 mean_speed_on_paper1 6.005949e+01
14 num_of_pendown1 3.056514e+00
15 paper_time1 1.841963e+13
16 pressure_mean1 1.344270e+00
17 pressure_var1 1.529629e+00
18 total_time1 1.286743e+15
VIF for Task2:
Features VIF
0 const 9.704804e+01
1 air_time2 1.622919e+13
2 disp_index2 4.706975e+00
3 gmrt_in_air2 inf
4 gmrt_on_paper2 inf
5 max_x_extension2 1.725799e+00
6 max_y_extension2 1.393634e+00
7 mean_acc_in_air2 1.100280e+02
8 mean_acc_on_paper2 5.180047e+00
9 mean_gmrt2 inf
10 mean_jerk_in_air2 1.080608e+02
11 mean_jerk_on_paper2 4.420443e+00
12 mean_speed_in_air2 2.963337e+00
13 mean_speed_on_paper2 9.264403e+00
14 num_of_pendown2 6.865570e+00
15 paper_time2 4.816684e+13
16 pressure_mean2 1.433557e+00
17 pressure_var2 2.071652e+00
18 total_time2 2.918730e+12
VIF for Task3:
Features VIF
0 const 6.880673e+01
1 air_time3 9.790434e+13
2 disp_index3 5.629994e+00
3 gmrt_in_air3 inf
4 gmrt_on_paper3 inf
5 max_x_extension3 1.885585e+00
6 max_y_extension3 1.690008e+00
7 mean_acc_in_air3 1.892925e+02
8 mean_acc_on_paper3 5.315598e+00
9 mean_gmrt3 inf
10 mean_jerk_in_air3 1.833855e+02
11 mean_jerk_on_paper3 3.959730e+00
12 mean_speed_in_air3 2.978821e+00
13 mean_speed_on_paper3 1.198944e+01
14 num_of_pendown3 4.341671e+00
15 paper_time3 1.094435e+13
16 pressure_mean3 1.836625e+00
17 pressure_var3 2.275786e+00
18 total_time3 2.001600e+14
VIF for Task4:
Features VIF
0 const 2.949786e+02
1 air_time4 4.094181e+14
2 disp_index4 5.336681e+00
3 gmrt_in_air4 inf
4 gmrt_on_paper4 inf
5 max_x_extension4 3.812092e+00
6 max_y_extension4 3.434112e+00
7 mean_acc_in_air4 1.598819e+02
8 mean_acc_on_paper4 5.585926e+00
9 mean_gmrt4 inf
10 mean_jerk_in_air4 1.477640e+02
11 mean_jerk_on_paper4 5.572271e+00
12 mean_speed_in_air4 5.820308e+00
13 mean_speed_on_paper4 8.843293e+01
14 num_of_pendown4 2.924574e+00
15 paper_time4 3.216857e+14
16 pressure_mean4 1.710513e+00
17 pressure_var4 2.185169e+00
18 total_time4 7.505999e+14
VIF for Task5:
Features VIF
0 const 8.687105e+01
1 air_time5 1.272203e+13
2 disp_index5 7.741576e+00
3 gmrt_in_air5 inf
4 gmrt_on_paper5 inf
5 max_x_extension5 4.658236e+00
6 max_y_extension5 8.930549e+00
7 mean_acc_in_air5 2.831899e+01
8 mean_acc_on_paper5 5.299003e+00
9 mean_gmrt5 inf
10 mean_jerk_in_air5 2.399312e+01
11 mean_jerk_on_paper5 5.513568e+00
12 mean_speed_in_air5 3.830815e+00
13 mean_speed_on_paper5 1.211562e+02
14 num_of_pendown5 2.841678e+00
15 paper_time5 2.850380e+13
16 pressure_mean5 1.798498e+00
17 pressure_var5 1.807462e+00
18 total_time5 1.983965e+13
VIF for Task6:
Features VIF
0 const 9.853068e+01
1 air_time6 inf
2 disp_index6 4.241558e+00
3 gmrt_in_air6 inf
4 gmrt_on_paper6 inf
5 max_x_extension6 1.348731e+00
6 max_y_extension6 1.645329e+00
7 mean_acc_in_air6 1.155542e+03
8 mean_acc_on_paper6 5.847207e+00
9 mean_gmrt6 inf
10 mean_jerk_in_air6 1.142259e+03
11 mean_jerk_on_paper6 5.108992e+00
12 mean_speed_in_air6 1.890550e+01
13 mean_speed_on_paper6 3.247801e+01
14 num_of_pendown6 3.598965e+00
15 paper_time6 2.573486e+14
16 pressure_mean6 1.255557e+00
17 pressure_var6 1.377617e+00
18 total_time6 3.768703e+13
VIF for Task7:
Features VIF
0 const 8.494999e+02
1 air_time7 6.928615e+14
2 disp_index7 6.974183e+00
3 gmrt_in_air7 inf
4 gmrt_on_paper7 inf
5 max_x_extension7 1.610340e+00
6 max_y_extension7 1.372123e+00
7 mean_acc_in_air7 1.839324e+03
8 mean_acc_on_paper7 6.349595e+00
9 mean_gmrt7 inf
10 mean_jerk_in_air7 1.801752e+03
11 mean_jerk_on_paper7 6.963139e+00
12 mean_speed_in_air7 6.869568e+01
13 mean_speed_on_paper7 2.106457e+01
14 num_of_pendown7 2.280263e+00
15 paper_time7 2.037828e+13
16 pressure_mean7 1.290672e+00
17 pressure_var7 1.295256e+00
18 total_time7 2.537239e+13
VIF for Task8:
Features VIF
0 const 1.245849e+02
1 air_time8 2.370316e+14
2 disp_index8 7.365519e+00
3 gmrt_in_air8 inf
4 gmrt_on_paper8 inf
5 max_x_extension8 2.881130e+00
6 max_y_extension8 3.467209e+00
7 mean_acc_in_air8 1.427206e+02
8 mean_acc_on_paper8 3.798652e+00
9 mean_gmrt8 inf
10 mean_jerk_in_air8 1.405687e+02
11 mean_jerk_on_paper8 4.111470e+00
12 mean_speed_in_air8 3.143658e+00
13 mean_speed_on_paper8 2.348250e+02
14 num_of_pendown8 5.805948e+00
15 paper_time8 2.914951e+13
16 pressure_mean8 1.417820e+00
17 pressure_var8 1.801673e+00
18 total_time8 3.216857e+14
VIF for Task9:
Features VIF
0 const 1.652827e+02
1 air_time9 1.358552e+13
2 disp_index9 6.093110e+00
3 gmrt_in_air9 inf
4 gmrt_on_paper9 inf
5 max_x_extension9 3.770112e+00
6 max_y_extension9 4.416594e+00
7 mean_acc_in_air9 4.141213e+02
8 mean_acc_on_paper9 3.375279e+00
9 mean_gmrt9 inf
10 mean_jerk_in_air9 4.021643e+02
11 mean_jerk_on_paper9 3.905194e+00
12 mean_speed_in_air9 4.085624e+00
13 mean_speed_on_paper9 1.540690e+02
14 num_of_pendown9 6.053427e+00
15 paper_time9 3.216857e+14
16 pressure_mean9 1.914346e+00
17 pressure_var9 2.644888e+00
18 total_time9 4.503600e+14
VIF for Task10:
Features VIF
0 const 1.182665e+02
1 air_time10 8.578285e+13
2 disp_index10 4.446180e+00
3 gmrt_in_air10 inf
4 gmrt_on_paper10 inf
5 max_x_extension10 4.187501e+00
6 max_y_extension10 3.759179e+00
7 mean_acc_in_air10 8.702232e+01
8 mean_acc_on_paper10 9.031832e+00
9 mean_gmrt10 inf
10 mean_jerk_in_air10 8.350087e+01
11 mean_jerk_on_paper10 5.244241e+00
12 mean_speed_in_air10 5.492209e+00
13 mean_speed_on_paper10 8.183551e+01
14 num_of_pendown10 4.846591e+00
15 paper_time10 3.102721e+12
16 pressure_mean10 1.469359e+00
17 pressure_var10 2.000801e+00
18 total_time10 1.344358e+14
VIF for Task11:
Features VIF
0 const 1.521792e+02
1 air_time11 1.286743e+15
2 disp_index11 3.094205e+00
3 gmrt_in_air11 inf
4 gmrt_on_paper11 inf
5 max_x_extension11 1.799007e+00
6 max_y_extension11 2.801996e+00
7 mean_acc_in_air11 1.813071e+02
8 mean_acc_on_paper11 7.701560e+00
9 mean_gmrt11 inf
10 mean_jerk_in_air11 1.738292e+02
11 mean_jerk_on_paper11 5.456851e+00
12 mean_speed_in_air11 5.067461e+00
13 mean_speed_on_paper11 6.059699e+01
14 num_of_pendown11 2.907528e+00
15 paper_time11 3.721983e+13
16 pressure_mean11 1.358360e+00
17 pressure_var11 1.424200e+00
18 total_time11 4.503600e+15
VIF for Task12:
Features VIF
0 const 8.049717e+01
1 air_time12 1.125900e+15
2 disp_index12 5.447076e+00
3 gmrt_in_air12 inf
4 gmrt_on_paper12 inf
5 max_x_extension12 3.211526e+00
6 max_y_extension12 5.092104e+00
7 mean_acc_in_air12 6.246471e+01
8 mean_acc_on_paper12 5.418308e+00
9 mean_gmrt12 inf
10 mean_jerk_in_air12 5.732546e+01
11 mean_jerk_on_paper12 6.272319e+00
12 mean_speed_in_air12 3.686338e+00
13 mean_speed_on_paper12 9.989373e+01
14 num_of_pendown12 3.089404e+00
15 paper_time12 2.434378e+14
16 pressure_mean12 1.563005e+00
17 pressure_var12 1.832552e+00
18 total_time12 1.668000e+14
VIF for Task13:
Features VIF
0 const 1.584171e+02
1 air_time13 4.228732e+13
2 disp_index13 4.904570e+00
3 gmrt_in_air13 inf
4 gmrt_on_paper13 inf
5 max_x_extension13 3.491735e+00
6 max_y_extension13 4.665869e+00
7 mean_acc_in_air13 3.003368e+02
8 mean_acc_on_paper13 4.274424e+00
9 mean_gmrt13 inf
10 mean_jerk_in_air13 2.879399e+02
11 mean_jerk_on_paper13 4.966804e+00
12 mean_speed_in_air13 6.045759e+00
13 mean_speed_on_paper13 1.167574e+02
14 num_of_pendown13 3.887106e+00
15 paper_time13 1.286743e+14
16 pressure_mean13 1.330342e+00
17 pressure_var13 1.794759e+00
18 total_time13 1.073564e+13
VIF for Task14:
Features VIF
0 const 8.094768e+01
1 air_time14 inf
2 disp_index14 6.825483e+00
3 gmrt_in_air14 inf
4 gmrt_on_paper14 inf
5 max_x_extension14 1.089854e+01
6 max_y_extension14 4.578782e+00
7 mean_acc_in_air14 6.499061e+02
8 mean_acc_on_paper14 1.098977e+01
9 mean_gmrt14 inf
10 mean_jerk_in_air14 6.298104e+02
11 mean_jerk_on_paper14 8.400442e+00
12 mean_speed_in_air14 1.725920e+01
13 mean_speed_on_paper14 4.798802e+01
14 num_of_pendown14 3.901245e+00
15 paper_time14 2.943529e+13
16 pressure_mean14 1.586477e+00
17 pressure_var14 1.672349e+00
18 total_time14 1.501200e+15
VIF for Task15:
Features VIF
0 const 1.323029e+02
1 air_time15 3.464307e+14
2 disp_index15 4.014468e+00
3 gmrt_in_air15 inf
4 gmrt_on_paper15 inf
5 max_x_extension15 2.624357e+00
6 max_y_extension15 3.753577e+00
7 mean_acc_in_air15 4.105828e+02
8 mean_acc_on_paper15 6.274418e+00
9 mean_gmrt15 inf
10 mean_jerk_in_air15 3.979800e+02
11 mean_jerk_on_paper15 7.844200e+00
12 mean_speed_in_air15 1.940283e+01
13 mean_speed_on_paper15 2.212481e+01
14 num_of_pendown15 3.149710e+00
15 paper_time15 7.832347e+13
16 pressure_mean15 1.312124e+00
17 pressure_var15 1.336161e+00
18 total_time15 1.000800e+15
VIF for Task16:
Features VIF
0 const 7.712815e+01
1 air_time16 1.544974e+13
2 disp_index16 1.857458e+01
3 gmrt_in_air16 inf
4 gmrt_on_paper16 inf
5 max_x_extension16 1.386369e+01
6 max_y_extension16 7.890255e+00
7 mean_acc_in_air16 4.199157e+02
8 mean_acc_on_paper16 5.057406e+00
9 mean_gmrt16 inf
10 mean_jerk_in_air16 4.164022e+02
11 mean_jerk_on_paper16 4.521262e+00
12 mean_speed_in_air16 6.617539e+00
13 mean_speed_on_paper16 1.551746e+01
14 num_of_pendown16 5.596677e+00
15 paper_time16 1.407375e+14
16 pressure_mean16 1.369743e+00
17 pressure_var16 1.484131e+00
18 total_time16 3.832851e+13
VIF for Task17:
Features VIF
0 const 2.832389e+02
1 air_time17 3.105931e+14
2 disp_index17 8.301257e+00
3 gmrt_in_air17 inf
4 gmrt_on_paper17 inf
5 max_x_extension17 5.030863e+00
6 max_y_extension17 6.539413e+00
7 mean_acc_in_air17 3.577205e+03
8 mean_acc_on_paper17 7.362602e+00
9 mean_gmrt17 inf
10 mean_jerk_in_air17 3.511993e+03
11 mean_jerk_on_paper17 8.268067e+00
12 mean_speed_in_air17 9.230383e+01
13 mean_speed_on_paper17 2.514241e+01
14 num_of_pendown17 5.104453e+00
15 paper_time17 4.647678e+12
16 pressure_mean17 1.570213e+00
17 pressure_var17 1.806342e+00
18 total_time17 1.507986e+12
VIF for Task18:
Features VIF
0 const 7.116426e+01
1 air_time18 4.003200e+13
2 disp_index18 1.309099e+01
3 gmrt_in_air18 inf
4 gmrt_on_paper18 inf
5 max_x_extension18 5.855448e+00
6 max_y_extension18 4.270972e+00
7 mean_acc_in_air18 1.845349e+02
8 mean_acc_on_paper18 8.385470e+00
9 mean_gmrt18 inf
10 mean_jerk_in_air18 1.734955e+02
11 mean_jerk_on_paper18 5.107061e+00
12 mean_speed_in_air18 1.128941e+01
13 mean_speed_on_paper18 2.706164e+01
14 num_of_pendown18 1.867973e+01
15 paper_time18 2.309538e+14
16 pressure_mean18 1.400963e+00
17 pressure_var18 1.394921e+00
18 total_time18 6.928615e+13
VIF for Task19:
Features VIF
0 const 3.639880e+02
1 air_time19 inf
2 disp_index19 5.236016e+00
3 gmrt_in_air19 inf
4 gmrt_on_paper19 inf
5 max_x_extension19 1.962953e+00
6 max_y_extension19 1.640485e+00
7 mean_acc_in_air19 5.289859e+02
8 mean_acc_on_paper19 7.604564e+00
9 mean_gmrt19 inf
10 mean_jerk_in_air19 5.137661e+02
11 mean_jerk_on_paper19 1.019701e+01
12 mean_speed_in_air19 1.199893e+02
13 mean_speed_on_paper19 2.593118e+01
14 num_of_pendown19 4.254184e+00
15 paper_time19 1.324588e+14
16 pressure_mean19 1.895993e+00
17 pressure_var19 1.746712e+00
18 total_time19 inf
VIF for Task20:
Features VIF
0 const 1.682869e+02
1 air_time20 4.094181e+14
2 disp_index20 7.056759e+00
3 gmrt_in_air20 inf
4 gmrt_on_paper20 inf
5 max_x_extension20 3.484185e+00
6 max_y_extension20 2.816101e+00
7 mean_acc_in_air20 2.397273e+02
8 mean_acc_on_paper20 7.461168e+00
9 mean_gmrt20 inf
10 mean_jerk_in_air20 2.278487e+02
11 mean_jerk_on_paper20 7.019694e+00
12 mean_speed_in_air20 1.718689e+01
13 mean_speed_on_paper20 3.695920e+01
14 num_of_pendown20 5.265015e+00
15 paper_time20 2.251800e+14
16 pressure_mean20 1.698360e+00
17 pressure_var20 1.843513e+00
18 total_time20 1.233863e+14
VIF for Task21:
Features VIF
0 const 3.580011e+02
1 air_time21 6.721790e+13
2 disp_index21 1.274099e+01
3 gmrt_in_air21 inf
4 gmrt_on_paper21 inf
5 max_x_extension21 5.584915e+00
6 max_y_extension21 5.399551e+00
7 mean_acc_in_air21 2.848989e+03
8 mean_acc_on_paper21 1.029701e+01
9 mean_gmrt21 inf
10 mean_jerk_in_air21 2.741564e+03
11 mean_jerk_on_paper21 9.135640e+00
12 mean_speed_in_air21 1.159173e+01
13 mean_speed_on_paper21 5.017173e+01
14 num_of_pendown21 5.968672e+00
15 paper_time21 2.690322e+12
16 pressure_mean21 2.499224e+00
17 pressure_var21 3.732340e+00
18 total_time21 1.941207e+13
VIF for Task22:
Features VIF
0 const 3.204558e+02
1 air_time22 inf
2 disp_index22 5.860487e+00
3 gmrt_in_air22 inf
4 gmrt_on_paper22 inf
5 max_x_extension22 2.872050e+00
6 max_y_extension22 2.671489e+00
7 mean_acc_in_air22 7.719101e+01
8 mean_acc_on_paper22 6.582674e+00
9 mean_gmrt22 inf
10 mean_jerk_in_air22 7.629409e+01
11 mean_jerk_on_paper22 7.539692e+00
12 mean_speed_in_air22 2.115548e+01
13 mean_speed_on_paper22 3.407063e+01
14 num_of_pendown22 2.417024e+00
15 paper_time22 7.970973e+13
16 pressure_mean22 1.533381e+00
17 pressure_var22 1.530904e+00
18 total_time22 inf
VIF for Task23:
Features VIF
0 const 2.562240e+02
1 air_time23 8.188363e+14
2 disp_index23 3.804280e+00
3 gmrt_in_air23 inf
4 gmrt_on_paper23 inf
5 max_x_extension23 2.540148e+00
6 max_y_extension23 3.855874e+00
7 mean_acc_in_air23 7.356635e+01
8 mean_acc_on_paper23 7.190413e+00
9 mean_gmrt23 inf
10 mean_jerk_in_air23 7.130591e+01
11 mean_jerk_on_paper23 6.147819e+00
12 mean_speed_in_air23 2.879408e+01
13 mean_speed_on_paper23 3.586943e+01
14 num_of_pendown23 2.098031e+00
15 paper_time23 2.047091e+13
16 pressure_mean23 1.373074e+00
17 pressure_var23 1.249327e+00
18 total_time23 1.286743e+15
VIF for Task24:
Features VIF
0 const 1.599396e+02
1 air_time24 2.370316e+14
2 disp_index24 7.650013e+00
3 gmrt_in_air24 inf
4 gmrt_on_paper24 inf
5 max_x_extension24 1.127002e+01
6 max_y_extension24 9.385270e+00
7 mean_acc_in_air24 1.012693e+03
8 mean_acc_on_paper24 8.138784e+00
9 mean_gmrt24 inf
10 mean_jerk_in_air24 9.887079e+02
11 mean_jerk_on_paper24 1.079159e+01
12 mean_speed_in_air24 5.749085e+01
13 mean_speed_on_paper24 1.176921e+01
14 num_of_pendown24 2.766528e+00
15 paper_time24 5.629500e+14
16 pressure_mean24 1.595625e+00
17 pressure_var24 1.476198e+00
18 total_time24 5.088813e+13
VIF for Task25:
Features VIF
0 const 3.906111e+02
1 air_time25 3.002400e+15
2 disp_index25 8.061904e+00
3 gmrt_in_air25 inf
4 gmrt_on_paper25 inf
5 max_x_extension25 3.072491e+00
6 max_y_extension25 2.214602e+00
7 mean_acc_in_air25 1.586619e+03
8 mean_acc_on_paper25 7.287828e+00
9 mean_gmrt25 inf
10 mean_jerk_in_air25 1.578315e+03
11 mean_jerk_on_paper25 6.826671e+00
12 mean_speed_in_air25 7.074528e+01
13 mean_speed_on_paper25 2.947896e+01
14 num_of_pendown25 3.741947e+00
15 paper_time25 1.085205e+13
16 pressure_mean25 1.451165e+00
17 pressure_var25 1.856188e+00
18 total_time25 2.251800e+14
//anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i) //anaconda3/lib/python3.11/site-packages/statsmodels/stats/outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i)
The VIF calculations are showing some very high values, particularly for variables related to gmrt_in_air, gmrt_on_paper, and mean_gmrt, where VIFs are reported as infinity due to perfect multicollinearity.
The goal is to reduce multicollinearity while preserving as much useful information as possible. We'll focus on removing redundant features and potentially creating new features to capture the essence of the highly correlated variables.
Time-Related Features:
Keep: total_time for each task (captures overall task duration)
Remove: air_time and paper_time (highly correlated with total_time)
Feature Engineering:
Air-to-Paper Ratio: air_time / paper_time (measures relative time in air vs. on paper)
GMRT Features:
Primary GMRT Feature: Keep mean_gmrt for each task, as it represents the average global mean relative tremor.
Redundant GMRT Features: Remove gmrt_in_air and gmrt_on_paper since they contribute heavily to the mean_gmrt.
Speed Features:
Keep: mean_speed_on_paper for each task (directly measures writing speed)
Remove: mean_speed_in_air (less directly related to handwriting quality)
Movement Smoothness Features:
Keep: mean_jerk_in_air and mean_jerk_on_paper (captures smoothness of movement)
Remove: mean_acc_in_air and mean_acc_on_paper (highly correlated with jerk features)
Pressure Features:
Remove: pressure_mean and pressure_var (capture overall pressure and variation)
Feature Engineering:
Pressure Variation Index: pressure_var / pressure_mean (measures pressure fluctuation)
Spatial Features:
Keep: disp_index (measures how much of the paper is used)
Remove: max_x_extension and max_y_extension (measure writing extent)
Feature Engineering:
Writing Area: max_x_extension * max_y_extension (captures the overall area covered)
Pendowns Number:
Keep: num_of_pendown (measures how many times the pen touches the paper)
# Remove rows from your dataset X where any column has the value "0.000000"
# Convert "0.000000" to numeric 0
df = df.replace("0.000000", 0)
# Check if any column has 0 and filter rows
df = df[(df != 0).all(axis=1)]
print(f"There are {sum(df.isnull().sum() > 0)} missing values.")
There are 0 missing values.
def mahalanobis_distance(data):
data_array = data.to_numpy()
mean_vector = np.mean(data_array, axis=0)
cov_matrix = np.cov(data_array, rowvar=False)
cov_inv = np.linalg.inv(cov_matrix)
diff = data_array - mean_vector
mahalanobis_dist = np.sqrt(np.sum(np.dot(diff, cov_inv) * diff, axis=1))
return mahalanobis_dist
def detect_outliers(data, threshold=0.95):
mahalanobis_dist = mahalanobis_distance(data)
chi2_threshold = chi2.ppf(threshold, df=data.shape[1])
outlier_indices = np.where(mahalanobis_dist > chi2_threshold)[0]
return outlier_indices
outlier_indices = detect_outliers(df.iloc[:,:-1])
# Print indices of outlier data points
print("Indices of outlier data points:", outlier_indices)
print("Outlier percentages (%): ", len(outlier_indices)/len(df)*100)
Indices of outlier data points: [] Outlier percentages (%): 0.0
/var/folders/kt/rg6_d5c90v16qpfm3hltsfb80000gn/T/ipykernel_6608/3364540850.py:12: RuntimeWarning: invalid value encountered in sqrt mahalanobis_dist = np.sqrt(np.sum(np.dot(diff, cov_inv) * diff, axis=1))
df['class'] = df['class'].replace({'H': 0, 'P': 1})
print(df['class'].value_counts())
class 1 83 0 82 Name: count, dtype: int64
/var/folders/kt/rg6_d5c90v16qpfm3hltsfb80000gn/T/ipykernel_6608/3161964165.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df['class'] = df['class'].replace({'H': 0, 'P': 1})
clean_df = df
clean_df
| air_time1 | disp_index1 | gmrt_in_air1 | gmrt_on_paper1 | max_x_extension1 | max_y_extension1 | mean_acc_in_air1 | mean_acc_on_paper1 | mean_gmrt1 | mean_jerk_in_air1 | ... | mean_jerk_in_air25 | mean_jerk_on_paper25 | mean_speed_in_air25 | mean_speed_on_paper25 | num_of_pendown25 | paper_time25 | pressure_mean25 | pressure_var25 | total_time25 | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5160 | 0.000013 | 120.804174 | 86.853334 | 957 | 6601 | 0.361800 | 0.217459 | 103.828754 | 0.051836 | ... | 0.141434 | 0.024471 | 5.596487 | 3.184589 | 71 | 40120 | 1749.278166 | 296102.7676 | 144605 | 1 |
| 1 | 51980 | 0.000016 | 115.318238 | 83.448681 | 1694 | 6998 | 0.272513 | 0.144880 | 99.383459 | 0.039827 | ... | 0.049663 | 0.018368 | 1.665973 | 0.950249 | 129 | 126700 | 1504.768272 | 278744.2850 | 298640 | 1 |
| 3 | 2130 | 0.000010 | 369.403342 | 183.193104 | 1756 | 8159 | 0.556879 | 0.164502 | 276.298223 | 0.090408 | ... | 0.113905 | 0.019860 | 4.206746 | 1.613522 | 123 | 67945 | 1465.843329 | 230184.7154 | 181220 | 1 |
| 4 | 2310 | 0.000007 | 257.997131 | 111.275889 | 987 | 4732 | 0.266077 | 0.145104 | 184.636510 | 0.037528 | ... | 0.121782 | 0.020872 | 3.319036 | 1.680629 | 92 | 37285 | 1841.702561 | 158290.0255 | 72575 | 1 |
| 5 | 1920 | 0.000011 | 199.764957 | 109.902254 | 1548 | 6260 | 0.212523 | 0.143013 | 154.833606 | 0.028369 | ... | 0.131135 | 0.018907 | 3.643543 | 1.667827 | 76 | 43790 | 1081.054579 | 152045.4446 | 74605 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 169 | 2930 | 0.000010 | 241.736477 | 176.115957 | 1839 | 6439 | 0.253347 | 0.174663 | 208.926217 | 0.032691 | ... | 0.119152 | 0.020909 | 4.508709 | 2.233198 | 96 | 44545 | 1798.923336 | 247448.3108 | 80335 | 0 |
| 170 | 2140 | 0.000009 | 274.728964 | 234.495802 | 2053 | 8487 | 0.225537 | 0.174920 | 254.612383 | 0.032059 | ... | 0.174495 | 0.017640 | 4.685573 | 2.806888 | 84 | 37560 | 1725.619941 | 160664.6464 | 345835 | 0 |
| 171 | 3830 | 0.000008 | 151.536989 | 171.104693 | 1287 | 7352 | 0.165480 | 0.161058 | 161.320841 | 0.022705 | ... | 0.114472 | 0.017194 | 3.493815 | 2.510601 | 88 | 51675 | 1915.573488 | 128727.1241 | 83445 | 0 |
| 172 | 1760 | 0.000008 | 289.518195 | 196.411138 | 1674 | 6946 | 0.518937 | 0.202613 | 242.964666 | 0.090686 | ... | 0.114472 | 0.017194 | 3.493815 | 2.510601 | 88 | 51675 | 1915.573488 | 128727.1241 | 83445 | 0 |
| 173 | 2875 | 0.000008 | 235.769350 | 178.208024 | 1838 | 6560 | 0.567311 | 0.147818 | 206.988687 | 0.099555 | ... | 0.114472 | 0.017194 | 3.493815 | 2.510601 | 88 | 51675 | 1915.573488 | 128727.1241 | 83445 | 0 |
165 rows × 451 columns
# Save the DataFrame 'clean_df' to a CSV file named 'clean_df.csv'
clean_df.to_csv('clean_df.csv', index=False)